    Application-level Fault Tolerance and Resilience in HPC Applications

    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense sistemas de computación de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos, o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires que os programas científicos poidan completar a súa execución, evitando ademais que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart a nivel de aplicación para os modelos de programación paralela roáis populares en supercomputación. Implementáronse protocolos de checkpointing para aplicacións híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos dous casos prestando especial coidado á portabilidade e maleabilidade da solución. En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global. A maiores, esta solución de resiliencia optimizouse implementando unha volta atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de almacenaxe de mensaxes para garantires a consistencia e o progreso da execución. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. Checkpoint/restart is one of the most popular fault tolerance techniques. However, most of the research in this field is focused on stop-and-restart strategies for distributed-memory applications in the event of fail-stop failures. Thís thesis focuses on the implementation of application-level checkpoint/restart solutions for the most popular parallel programming models used in HPC. Hence, we have implemented checkpointing solutions to cope with fail-stop failures in hybrid MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize the restart portability and malleability, ie., the recovery can take place on machines with different CPU / accelerator architectures, and/ or operating systems, and can be adapted to the available resources (number of cores/accelerators). Regarding distributed-memory applications, we propose a resilience solution that can be generally applied to SPMD MPI programs. Resilient applications can detect and react to failures without aborting their execution upon fail-stop failures. Instead, failed processes are re-spawned, and the application state is recovered through a global rollback. Moreover, we have optimized this resilience proposal by implementing a local rollback protocol, in which only failed processes rollback to a previous state, while message logging enables global consistency and further progress of the computation. Finally, we have extended a checkpointing library to facilitate the implementation of ad hoc recovery strategies in the event of soft errors) caused by memory corruptions. Many times, these errors can be handled at the software-Ievel, tIms, avoiding fail-stop failures and enabling a more efficient recovery


    Initial Teacher Education (ITE) must be studied in line with the changing and diverse reality of schools. The qualification of future early childhood teachers will depend on their mastery of theoretical and practical knowledge and the importance they attach to having this knowledge available in their professional practice. The present study analyses the degree of mastery and relevance that students give to the competences in attention to diversity acquired during their university education and the perceived training needs. Under a mixed methods research approach and a sequential explanatory design, the Professional Profiles Questionnaire of the Degree in Early Childhood Education was designed and applied to 141 students of the University of A Coruña, in the 3rd and 4th year. A positive appraisal of the ITE in attention to diversity was observed, although the degree of mastery was lower than the relevance given in all the dimensions studied.     Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1863-z[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; BES-2014-068066Galicia.Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    Resilience of Parallel Applications

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC applications, however, it is also a hazard for the completion of their execution. Even if one computation node presents a failure every one century, a machine with 100.000 nodes will encounter a failure every 9 hours. Thus, HPC applications need to make use of fault tolerance techniques to ensure they successfully finish their execution. This PhD thesis is focused on fault tolerance solutions for generic parallel applications, more specifically in checkpointing solutions. We have extended CPPC, an MPI application-level portable checkpointing tool developed in our research group, to work with OpenMP applications, and hybrid MPI-OpenMP applications. Currently, we are working on transparently obtaining resilient MPI applications, that is, applications that are able to recover themselves from failures without stopping their execution.European Cooperation in Science and Technology. COSTThis research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P, and the predoctoral grant of Nuria Losada ref. BES-2014-068066) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    Педагоги дошкольного образования и инклюзивное образование: компетенции и потребности

    El bienestar escolar presenta un creciente interés en la investigación y práctica educativa, aunque su naturaleza multidimensional y la imprecisión de su definición limita su conocimiento y hace necesario un estudio más profundo. El objetivo fue ahondar en su comprensión identificando perspectivas, modelos y aspectos definitorios. Se realizó una revisión sistemática de 53 documentos de bases de datos internacionales (APA, PsycInfo, ERIC, Scopus, WoS). Se usó el enfoque PICO para formular criterios de elegibilidad y buscar preguntas de investigación, y las recomendaciones PRISMA para revisiones sistemáticas. Se incluyeron artículos y ponencias de conferencias revisados por pares de las áreas de educación y psicología, publicados entre 2000 y 2020, con las palabras clave school y wellbeing o well-being, publicados en inglés, francés, portugués y español. Se excluyeron temas de salud y enfermedad, trabajo, universidad y cuestiones sociales, económicas, políticas o culturales. La información se analizó descriptivamente mediante metanarrativa. Se presentaron las características de los estudios (metodología, participantes, años de publicación y países); se explicaron las perspectivas clásicamente vinculadas al bienestar escolar como concepto subjetivo (hedónico) y psicológico (eudaimónico), incorporando el bienestar social; y se identificaron factores que lo operacionalizan. Se plantean limitaciones relacionadas con la evidencia incluida (sesgo de publicación, uso de publicaciones sobre bienestar general) y relativas a los procesos de revisión (filtro lingüístico). En definitiva, los componentes subjetivo, psicológico y social deben recibir una atención diferenciada, pero interconectada, superando la visión restrictiva de estudios previos y permitiendo el desarrollo de propuestas educativas integradoras que promuevan el bienestar escolar.School well-being has a growing interest in educational research and practice, although its multidimensional nature and the imprecision in its definition limit its knowledge and make more in-depth study necessary. The aim was to deepen the understanding of the construct of school well-being identifying perspectives, models and definitory elements. A systematic review of 53 bibliographic sources from internationally databases (APA, PsycInfo, ERIC, Scopus, WoS) was conducted. PICO approach for formulating the eligibility criteria and searching for research questions, and PRISMA-compliance systematic review recommendations were followed. There were included articles and papers conferences, from 2000 to 2020, with the keywords school wellbeing or well-being, in English, French, Portuguese and Spanish languages. Topics related to health and illness, work, university, and social, economic, politic, or cultural issues were excluded. Information was analysed descriptively using the meta-narrative. The characteristics of the studies (methodology, participants, years of publication and countries) were presented; the perspectives classically linked to school well-being as a subjective (hedonic) and psychological (eudemonic) concept as well as the social well-being were explained; and the factors that operationalize it were identified.O bem-estar escolar é de interesse crescente na investigação e prática educacional, embora a sua natureza multidimensional e definição imprecisa limite a sua compreensão e exija um estudo mais aprofundado. O objetivo era aprofundar a sua compreensão através da identificação de perspetivas, modelos e definição de aspetos. Foi realizada uma revisão sistemática de 53 documentos de bases de dados internacionais (APA, PsycInfo, ERIC, Scopus, WoS). A abordagem PICO foi utilizada para formular critérios de elegibilidade e pesquisa de questões de investigação, e as recomendações PRISMA para revisões sistemáticas. Incluímos artigos revistos por pares e artigos de conferência dos campos da educação e da psicologia, publicados entre 2000 e 2020, com as palavras-chave school e wellbeing ou well-being, publicados em inglês, francês, português e espanhol. Foram excluídas questões de saúde e doença, trabalho, universidade e questões sociais, económicas, políticas ou culturais. A informação foi analisada de forma descritiva utilizando metanarrativa.     Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

    [Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.Ministerio de Economía y Competitividad; TIN2013-42148-

    Towards Ad Hoc Recovery for Soft Errors

    The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 708566 (DURO). This research is also supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grant of Nuria Losada ref. BES-2014-068066), and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04).Peer ReviewedPostprint (author's final draft

    Resilient MPI applications using an application-level checkpointing framework and ULFM

    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2014-53522-REDTMinisterio de Economía y Competitividad; BES-2014-068066Galicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/05

    A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications

    [Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used. Besides, applications can be restarted changing the host CPU and/or the accelerator device architecture, and adapting the computation to the number of devices available during recovery. The proposed solution is built combining CPPC (ComPiler for Portable Checkpointing), an application-level checkpointing tool, and HPL (Heterogeneous Programming Library), a library that facilitates the development of OpenCL-based applications. Experimental results show the low overhead introduced by the proposal and prove its portability and adaptability benefits.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2013-42148-P, TIN2016-75845-P and the predoctoral Grant of Nuria Losada Ref. BES-2014-068066), by EU under the COST Program Action IC1305, Network for Sustainable Ultrascale Computing (NESUS), and by the Galician Government (Xunta de Galicia) and FEDER funds of the EU under the Consolidation Program of Competitive Research (Ref. GRC2013/055)Xunta de Galicia; GRC 2013/05